Performance Improvement for Frequent Term-based Text Clustering Algorithm

نویسنده

Zhongmin Shi

چکیده

Frequent term-based text clustering [2] is a recently introduced text clustering technique, which uses frequent term sets and dramatically decreases the dimensionality of the document vector space, thus especially addressing itself to the problems of text clustering: very high dimensionality of the date and very large size of the databases [2]. Moreover, frequent term sets provide understandable meanings for clusters. Frequent term-based text clustering algorithm (FTC) has shown significant efficiency comparing to some well-known text clustering methods [2], but the quality of clustering still needs further enhancement. This report points out the problems of the overlap calculations in FTC and introduces polished algorithms that aim at improvements on both the running time and the cluster quality. The performances of FTC before and after improvements are compared on the basis of the experiments on classical text documents as well as on web documents. At last, an evaluation on the clustering procedure may provide a clue for further work. General Terms Algorithms, Performance, Experimentation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Effective Dimension Reduction Techniques for Text Documents

Frequent term based text clustering is a text clustering technique, which uses frequent term set and dramatically decreases the dimensionality of the document vector space, thus especially addressing: very high dimensionality of the data and very large size of the databases. Frequent Term based Clustering algorithm (FTC) has shown significant efficiency comparing to some well known text cluster...

متن کامل

Effective Term Based Text Clustering Algorithms

Text clustering methods can be used to group large sets of text documents. Most of the text clustering methods do not address the problems of text clustering such as very high dimensionality of the data and understandability of the clustering descriptions. In this paper, a frequent term based approach of clustering has been introduced; it provides a natural way of reducing a large dimensionalit...

متن کامل

Performance Evaluation of an Efficient Frequent Item sets-Based Text Clustering Approach

The vast amount of textual information available in electronic form is growing at a staggering rate in recent times. The task of mining useful or interesting frequent itemsets (words/terms) from very large text databases that are formed as a result of the increasing number of textual data still seems to be a quite challenging task. A great deal of attention in research community has been receiv...

متن کامل

Investigate the Performance of Document Clustering Approach Based on Association Rules Mining

The challenges of the standard clustering methods and the weaknesses of Apriori algorithm in frequent termset clustering formulate the goal of our research. Based on Association Rules mining, an efficient approach for Web Document Clustering (ARWDC) has been devised. An efficient Multi-Tire Hashing Frequent Termsets algorithm (MTHFT) has been used to improve the efficiency of mining association...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2004

Performance Improvement for Frequent Term-based Text Clustering Algorithm

نویسنده

چکیده

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Effective Dimension Reduction Techniques for Text Documents

Effective Term Based Text Clustering Algorithms

Performance Evaluation of an Efficient Frequent Item sets-Based Text Clustering Approach

Investigate the Performance of Document Clustering Approach Based on Association Rules Mining

عنوان ژورنال:

اشتراک گذاری